training system
EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine
There has been significant progress in developing reinforcement learning (RL) training systems. Past works such as IMPALA, Apex, Seed RL, Sample Factory, and others, aim to improve the system's overall throughput. In this paper, we aim to address a common bottleneck in the RL training system, i.e., parallel environment execution, which is often the slowest part of the whole system but receives little attention. With a curated design for paralleling RL environments, we have improved the RL environment simulation speed across different hardware setups, ranging from a laptop and a modest workstation, to a high-end machine such as NVIDIA DGX-A100. On a high-end machine, EnvPool achieves one million frames per second for the environment execution on Atari environments and three million frames per second on MuJoCo environments. When running EnvPool on a laptop, the speed is 2.8x that of the Python subprocess. Moreover, great compatibility with existing RL training libraries has been demonstrated in the open-sourced community, including CleanRL, rl_games, DeepMind Acme, etc. Finally, EnvPool allows researchers to iterate their ideas at a much faster pace and has great potential to become the de facto RL environment execution engine. Example runs show that it only takes five minutes to train agents to play Atari Pong and MuJoCo Ant on a laptop. EnvPool is open-sourced at https://github.com/sail-sg/envpool.
RL in the Wild: Characterizing RLVR Training in LLM Deployment
Zhou, Jiecheng, Hu, Qinghao, Jin, Yuyang, Wang, Zerui, Sun, Peng, Gu, Yuzhe, Zhang, Wenwei, Zhai, Mingshu, Zhang, Xingcheng, Zhang, Weiming
Large Language Models (LLMs) are now widely used across many domains. With their rapid development, Reinforcement Learning with V erifiable Rewards (RL VR) has surged in recent months to enhance their reasoning and understanding abilities. However, its complex data flows, and diverse tasks pose substantial challenges to RL training systems, and there is limited understanding of RL VR from a system perspective. To thoroughly understand the system challenges introduced by RL VR, we present a characterization study of RL VR tasks in our LLM deployment. Specifically, we investigate the distribution and variation trends of workloads across different RL tasks across training steps. We identify issues such as GPU idling caused by skewed sequence length distribution, inefficient parallel strategies in dynamically varying workloads, inefficient data management mechanisms, and load imbalance. We describe our observations and call for further investigation into the remaining open challenges. Furthermore, we propose PolyTrace benchmark suite to conduct evaluation with realistic workloads, a practical use case validates that PolyTrace benchmark suite exhibits 94.7% accuracy.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (5 more...)
Probing Experts' Perspectives on AI-Assisted Public Speaking Training
Fourati, Nesrine, Barkar, Alisa, Dragée, Marion, Danthon-Lefebvre, Liv, Chollet, Mathieu
Background: Public speaking is a vital professional skill, yet it remains a source of significant anxiety for many individuals. Traditional training relies heavily on expert coaching, but recent advances in AI has led to novel types of commercial automated public speaking feedback tools. However, most research has focused on prototypes rather than commercial applications, and little is known about how public speaking experts perceive these tools. Objectives: This study aims to evaluate expert opinions on the efficacy and design of commercial AI-based public speaking training tools and to propose guidelines for their improvement. Methods: The research involved 16 semi-structured interviews and 2 focus groups with public speaking experts. Participants discussed their views on current commercial tools, their potential integration into traditional coaching, and suggestions for enhancing these systems. Results and Conclusions: Experts acknowledged the value of AI tools in handling repetitive, technical aspects of training, allowing coaches to focus on higher-level skills. However they found key issues in current tools, emphasising the need for personalised, understandable, carefully selected feedback and clear instructional design. Overall, they supported a hybrid model combining traditional coaching with AI-supported exercises.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
- (3 more...)
- Research Report (1.00)
- Questionnaire & Opinion Survey (1.00)
- Instructional Material (1.00)
- Personal > Interview (0.66)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)
- Education > Educational Technology > Educational Software > Computer Based Training (0.46)
- Education > Educational Setting > Online (0.46)
Developing a Robotic Surgery Training System for Wide Accessibility and Research
Shaker, Walid, Erden, Mustafa Suphi
-- Robotic surgery represents a major breakthrough in medical interventions, which has revolutionized surgical procedures. However, the high cost and limited accessibility of robotic surgery systems pose significant challenges for training purposes. This study addresses these issues by developing a cost-effective robotic laparoscopy training system that closely replicates advanced robotic surgery setups to ensure broad access for both on-site and remote users. Key innovations include the design of a low-cost robotic end-effector that effectively mimics high-end laparoscopic instruments. Additionally, a digital twin platform was established, facilitating detailed simulation, testing, and real-time monitoring, which enhances both system development and deployment. Furthermore, teleop-eration control was optimized, leading to improved trajectory tracking while maintaining remote center of motion (RCM) constraint, with a RMSE of 5 µ m and reduced system latency to 0.01 seconds. As a result, the system provides smooth, continuous motion and incorporates essential safety features, making it a highly effective tool for laparoscopic training.
- North America > United States > California > Santa Clara County > Sunnyvale (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Health & Medicine > Surgery (1.00)
- Health & Medicine > Health Care Technology (1.00)
EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine
There has been significant progress in developing reinforcement learning (RL) training systems. Past works such as IMPALA, Apex, Seed RL, Sample Factory, and others, aim to improve the system's overall throughput. In this paper, we aim to address a common bottleneck in the RL training system, i.e., parallel environment execution, which is often the slowest part of the whole system but receives little attention. With a curated design for paralleling RL environments, we have improved the RL environment simulation speed across different hardware setups, ranging from a laptop and a modest workstation, to a high-end machine such as NVIDIA DGX-A100. On a high-end machine, EnvPool achieves one million frames per second for the environment execution on Atari environments and three million frames per second on MuJoCo environments.
Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
Lu, Ning, Xie, Qian, Zhang, Hao, Fang, Wenyi, Zheng, Yang, Hu, Zheng, Ma, Jiantao
Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management
Wang, Yujie, Zhu, Shenhan, Fu, Fangcheng, Miao, Xupeng, Zhang, Jie, Zhu, Juan, Hong, Fan, Li, Yong, Cui, Bin
Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the heterogeneous workloads of different ML tasks and data modalities, training these models usually requires massive GPU resources and suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of large-scale MT MM models through data heterogeneity-aware model management optimization. The key idea is to decompose the model execution into stages and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. Based on this, we build a prototype system and evaluate it on various large MT MM models. Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach
Huynh, Nhat-Minh, Cao, Hoang-Giang, Wu, I-Chen
Pommerman is a multi-agent environment that has received considerable attention from researchers in recent years. This environment is an ideal benchmark for multi-agent training, providing a battleground for two teams with communication capabilities among allied agents. Pommerman presents significant challenges for model-free reinforcement learning due to delayed action effects, sparse rewards, and false positives, where opponent players can lose due to their own mistakes. This study introduces a system designed to train multi-agent systems to play Pommerman using a combination of curriculum learning and population-based self-play. We also tackle two challenging problems when deploying the multi-agent training system for competitive games: sparse reward and suitable matchmaking mechanism. Specifically, we propose an adaptive annealing factor based on agents' performance to adjust the dense exploration reward during training dynamically. Additionally, we implement a matchmaking mechanism utilizing the Elo rating system to pair agents effectively. Our experimental results demonstrate that our trained agent can outperform top learning agents without requiring communication among allied agents.
- Asia > Taiwan (0.04)
- North America > United States > New York (0.04)
- Asia > Thailand (0.04)
- Asia > South Korea (0.04)
- Leisure & Entertainment > Games > Computer Games (0.46)
- Leisure & Entertainment > Games > Chess (0.38)
Graph Neural Network Training Systems: A Performance Comparison of Full-Graph and Mini-Batch
Bajaj, Saurabh, Guan, Hui, Serafini, Marco
Graph Neural Networks (GNNs) have gained significant attention in recent years due to their ability to learn representations of graph structured data. Two common methods for training GNNs are mini-batch training and full-graph training. Since these two methods require different training pipelines and systems optimizations, two separate categories of GNN training systems emerged, each tailored for one method. Works that introduce systems belonging to a particular category predominantly compare them with other systems within the same category, offering limited or no comparison with systems from the other category. Some prior work also justifies its focus on one specific training method by arguing that it achieves higher accuracy than the alternative. The literature, however, has incomplete and contradictory evidence in this regard. In this paper, we provide a comprehensive empirical comparison of full-graph and mini-batch GNN training systems to get a clearer picture of the state of the art in the field. We find that the mini-batch training systems we consider consistently converge faster than the full-graph training ones across multiple datasets, GNN models, and system configurations, with speedups between 2.4x - 15.2x. We also find that both training techniques converge to similar accuracy values, so comparing systems across the two categories in terms of time-to-accuracy is a sound approach.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- Europe > Greece (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
HetHub: A Heterogeneous distributed hybrid training system for large-scale models
Xu, Si, Huang, Zixiao, Zeng, Yan, Yan, Shengen, Ning, Xuefei, Ye, Haolin, Gu, Sipei, Shui, Chunsheng, Lin, Zhezheng, Zhang, Hao, Wang, Sheng, Dai, Guohao, Wang, Yu
The development of large-scale models relies on a vast number of computing resources. For example, the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs for its training. It is a challenge to build a large-scale cluster with a type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism support on heterogeneous GPU-accelerators for large-scale models. It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic hybrid parallel module to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six different combinations of heterogeneous GPU-accelerators and the optimal performance of heterogeneous GPU-accelerators has achieved at least 90% of the theoretical upper bound performance of homogeneous GPU-accelerators.
- Information Technology > Hardware (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)